Unsupervised Spam Detection by Document Probability Estimation with Maximal Overlap Method
نویسندگان
چکیده
In this paper, we study content-based spam detection for spams that are generated by copying a seed document with some random perturbations. We propose an unsupervised detection algorithm based on an entropy-like measure called document complexity, which reflects how many similar documents exist in the input collection of documents. As the document complexity, however, is an ideal measure like Kolmogorov complexity, we substitute an estimated occurrence probability of each document for its complexity. We also present an efficient algorithm that estimates the probabilities of all documents in the collection in linear time to its total length. Experimental results showed that our algorithm especially works well for word salad spams, which are believed to be difficult to detect automatically.
منابع مشابه
Unsupervised Spam Detection by Document Complexity Estimation
In this paper, we study a content-based spam detection for a specific type of spams called blog and bulletin board spams. We develop an efficient unsupervised algorithm DCE that, detects spam documents from a mixture of spam and non-spam documents using a compression-based similarity measure, called the document complexity. Using suffix trees, the algorithm computes the document complexity for ...
متن کاملBotOnus: an online unsupervised method for Botnet detection
Botnets are recognized as one of the most dangerous threats to the Internet infrastructure. They are used for malicious activities such as launching distributed denial of service attacks, sending spam, and leaking personal information. Existing botnet detection methods produce a number of good ideas, but they are far from complete yet, since most of them cannot detect botnets in an early stage ...
متن کاملUnsupervised Spam Detection Based on String Alienness Measures
We propose an unsupervised method for detecting spam documents from Web page data, based on equivalence relations on strings. We propose 3 measures for quantifying the alienness (i.e. how different it is from others) of substring equivalence classes within a given set of strings. A document is then classified as spam if it contains a characteristic equivalence class as a substring. The proposed...
متن کاملAn Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network
In recent years, there has been considerable interest among people to use short message service (SMS) as one of the essential and straightforward communications services on mobile devices. The increased popularity of this service also increased the number of mobile devices attacks such as SMS spam messages. SMS spam messages constitute a real problem to mobile subscribers; this worries telecomm...
متن کاملMoving dispersion method for statistical anomaly detection in intrusion detection systems
A unified method for statistical anomaly detection in intrusion detection systems is theoretically introduced. It is based on estimating a dispersion measure of numerical or symbolic data on successive moving windows in time and finding the times when a relative change of the dispersion measure is significant. Appropriate dispersion measures, relative differences, moving windows, as well as tec...
متن کامل